Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Softmax optimization for GPU #15545

Merged
merged 16 commits into from
Aug 21, 2019
Merged

Softmax optimization for GPU #15545

merged 16 commits into from
Aug 21, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Jul 15, 2019

Description

This PR optimizes Softmax implementation for cases where stride is 1 and the leading dimension is small (up to 20kB of data in that dimension).

There are 2 optimizations in this kernel compared to the previous one:

  • using of the longer datatypes for loading/writing data (so using up to 8B per read/write instead of e.g. 2B in case of fp16 I/O)
  • using persistent storage to reduce the number of memory accesses (previous implementation used 3 reads and 1 write, this implementation uses 1 read and 1 write).

Compared to the previous implementation on fp16 I/O the new kernel is up to 4x faster.

@eric-haibin-lin

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@eric-haibin-lin eric-haibin-lin self-requested a review July 16, 2019 04:51
@karan6181
Copy link
Contributor

@mxnet-label-bot add [Operator, pr-awaiting-review]

@marcoabreu marcoabreu added Operator pr-awaiting-review PR is waiting for code review labels Jul 16, 2019
src/operator/nn/softmax-inl.h Outdated Show resolved Hide resolved
const int softmax_threads_per_block = 512;

template <typename OP, typename T>
__device__ inline T warp_reduce(T value, OP redfun) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a generic function that can be used elsewhere. Is there a better place to put this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will look into putting it in some better place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@sxjscience sxjscience Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, there are other common functions that spread in other files, e.g., the wrapper of the warp-level primitives in https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/layer_norm.cu#L32-L50

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// the division by zero warning generated for such invalid cases.
const int row_length = entries_per_load > 0 ? M / entries_per_load : 0;

const LType * in_aligned = reinterpret_cast<const LType *>(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: LType* instead of LType * as per Google Style Guide. Same for all other pointer declarations.

Copy link
Member

@sxjscience sxjscience Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the alignment? Since CUDA uses force-alignment, it will potentially raise an error if the address of in is not aligned with LType. For example, DType can be float32 and LType can be double

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sxjscience That is why the code that launches this kernel chooses LType based on the array dimensions - if the leading dimension is odd it will not choose LType larger than DType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the answer to this question but my concern is that we need to make sure that the following is true:

ASSERT(static_cast<size_t>(in) % sizeof(LType) == 0)

mainly due to the force-alignment constraint in CUDA (https://stackoverflow.com/questions/37323053/misaligned-address-in-cuda).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pointer given by cudaMalloc is guaranteed to be aligned to something like 256B or more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for correcting me. Another question is: "Would it be possible to handle the cases when the N in get_load_type(size_t N) can not be divided by 8?" Could we load the majority of the elements using the vectorizing trick and just handle the remainders? (This is just a question and there is no need to revise it in this PR because I think it looks great).

@ptrendx ptrendx mentioned this pull request Jul 18, 2019
7 tasks
@haojin2
Copy link
Contributor

haojin2 commented Jul 18, 2019

@ptrendx Do you have any data on how much performance boost this change is introducing on applicable example workloads?

@sxjscience
Copy link
Member

sxjscience commented Jul 19, 2019

Great! I'm considering to use this kind of warp-level primitives + vectorized load to accelerate our reduce function: (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh)

@ptrendx
Copy link
Member Author

ptrendx commented Jul 22, 2019

@haojin2 For the perf improvement on end to end training - I did not measure yet, measured just the kernel speedup for now. BTW - how important do you think safe accumulation option is (as in, having an option to NOT do it instead of just always doing safe accumulation)? Personally I don't see value in not using it in softmax as it would most probably affect accuracy, and having 3 TYPE_SWITCH makes the compilation time quite big. You did not put the ability to skip it in softmax_with_length, should I remove it from regular softmax as well?

@ptrendx
Copy link
Member Author

ptrendx commented Aug 5, 2019

@haojin2 I'm not sure how to progress with this PR - it seems that Windows CI instances do not have enough RAM to process all the templates here. The problem exists even with CPU compilation, where I just fixed your omission of MXNET_SAFE_ACCUMULATION=0 case in softmax with length.

@KellenSunderland
Copy link
Contributor

@sxjscience totally agree. This would provide a lot of benefit across the framework (for example the layernorm op).

@ptrendx
I see what you mean "fatal error C1002: compiler is out of heap space in pass 2". The CI windows machines should have a fair amount of RAM so this is a little strange.

@ptrendx
Copy link
Member Author

ptrendx commented Aug 9, 2019

@marcoabreu Could you give some advice on that Windows CI problem - are the Windows builder instances much different than the unix ones?

@marcoabreu
Copy link
Contributor

Yeah they are. but the heap space error is the same old problem that we just have too many macros I think. Basically we can't add any more operators because the file just grew too large.

Try compiling locally on Unix with optimizations disabled and enable intermediary output. You will see some intermediary files being multiple gigabyte in size.

@marcoabreu
Copy link
Contributor

@haojin2 had the same problem btw

@marcoabreu
Copy link
Contributor

But the windows instances are c5.18xlarge. they have plenty of ram. We're literally running into the limitations of the compilers

@ptrendx
Copy link
Member Author

ptrendx commented Aug 9, 2019

Would it help if we split the fwd and backward into different files? Those limitations are per file, right?

@marcoabreu
Copy link
Contributor

I'd rather split operators into different files. Forward and backwards path kinda belong together, right?

This could also lay the base for dynamic loading of operators where we could ship operators selectively.

@KellenSunderland
Copy link
Contributor

Thanks for the info Marco, appreciate you helping with this.

I remember the MSVC team was very slow in moving their compiler process to 64 bit. Just to be clear they supported compiling 64 bit programs very early on, but the compiler process itself didn't need much ram so it stayed 32-bit. Apparently now they do have a 64 bit compiler process, but 32-bit is still the default. Is there any way we could check any of the windows hosts, and see if the compiler process they're using (CL.exe) is running in 32-bit mode in the task manager? Are any of the compilation processes actually using more than 4GB of ram?

@marcoabreu
Copy link
Contributor

I remember that I checked it on Unix and that I saw high ram usage there. I'm currently not able to get data from windows, but the operator compilation has always been a bottleneck.

@marcoabreu
Copy link
Contributor

There really is a pattern that these kinds of errors always come when people try to add new operators

@larroy
Copy link
Contributor

larroy commented Aug 12, 2019

@larroy
Copy link
Contributor

larroy commented Aug 12, 2019

(or Newer)

@ptrendx ptrendx changed the title Softmax fwd optimization for GPU Softmax optimization for GPU Aug 19, 2019
@ptrendx
Copy link
Member Author

ptrendx commented Aug 20, 2019

Ok, it seems that splitting softmax.cc into 3 files, 1 for each operator (softmax, softmin and log_softmax) did the trick fortunately.

// By default temperature is 1.0.
// Adding a branch here to save the CPU 'divide-by-1' computation at runtime
DType final_result;
if (temperature == 1.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this micro-opt really making things better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have done performance comparison earlier. This check speed up the operator by 30%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not touch the CPU code besides merging it into a 1 function from 2 to not introduce slowdowns in that path.

@@ -301,7 +282,7 @@ __global__ void softmax_compute_kernel(DType *in, OType *out, index_t M, int axi

red::sum::SetInitValue(smem[x]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does have multiple max values affect numerical accuracy? Or are they reduced at some other point to a final max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand.
softmax

That is not very numerically stable, so there is another step introduced that finds the maximum x_i first and compute
softmax
It does not matter which maximum you take.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me rephrase. When running Reduce1D, are all the max values reduce into smem[0] / smax? as I understand the xmxax should be the max of {x_i}. It actually matters for numerical issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those intermediate max values are then reduced across threads to the final maximum.

in, out, M, axis, sshape, stride, temperature);
MSHADOW_CUDA_POST_KERNEL_CHECK(softmax_compute_kernel);
}
DType my_max_value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment or maybe a more descriptive name? is this the max of the stride?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest? This is the maximum value that this thread sees.

@apeforest
Copy link
Contributor

Thanks for refactoring the Softmax functions to make it into one.

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eric-haibin-lin eric-haibin-lin merged commit bdeb7bc into apache:master Aug 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants